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Abstract 

Existing computationally efficient methods for penalized likelihood GAM fitting employ iterative smooth- 
ness selection on working linear models (or working mixed models). Such schemes fail to converge for a 
non-negligible proportion of models, with failure being particularly frequent in the presence of concurvity. If 
smoothness selection is performed by optimizing 'whole model' criteria these problems disappear, but until now 
attempts to do this have employed finite difference based optimization schemes which are computationally in- 
efficient, and can suffer from false convergence. This paper develops the first computationally efficient method 
for direct GAM smoothness selection. It is highly stable, but by careful structuring achieves a computational 
efficiency that leads, in simulations, to lower mean computation times than the schemes based on working-model 
smoothness selection. The method also offers a reliable way of fitting generalized additive mixed models. 

Keywords: AIC, GACV, GAM, GAMM, GCV, penalized likelihood, penalized regression splines, stable com- 
putation. 



1 Introduction 

There are three basic approaches to estimating the smoothness of GeneraHzed Additive Models within a penaHzed 
likelihood framework. The first is to develop an efficient GCV (Craven and Wahba, 1979) or AIC (Akaike, 1973) 
based smoothness selection method for the simple (i.e. non-generalized) additive model case (e.g Gu and Wahba, 
1991, for smoothing spline ANOVA models; see also Mammen and Park, 2005, for backfit GAMs), and then to 
apply this method to each working additive model of the penalized iteratively re-weighted (P-IRLS) scheme used 
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Figure 1 : Presence (•) absence (+) data for mackerel eggs off the west coast of France. The data are based on the 
larger data set discussed in section l4~4l There are substantial problems fitting a GAM to these data with existing 
methods. 

to fit the GAM. This was initially proposed by Gu, (1992) for generalized smoothing spline ANOVA models: he 
termed the approach 'Performance Oriented Iteration' . Wood (2000) extended the method to computationally effi- 
cient low rank GAMs based on penalized regression splines, while Wood (2004) developed it further by providing 
an optimally stable smoothness selection method for simple additive models. The second approach is to treat the 
GAM as a generalized linear mixed model (see e.g. Ruppert et al. 2003), so that smoothing parameters become 
variance components. In the non-generalized case these variance components can be estimated by maximum like- 
lihood or REML, but in the generaUzed case methods based on iterative fitting of working linear mixed models are 
used (e.g. Breslow and Claytons's, 1993, 'PQL' approach). There can be no guarantee of convergence for methods 
based on iteratively selecting the smoothness of working linear (mixed) models, so the third approach avoids this 
problem by selecting smoothing parameters directly, based on AIC or GCV for the actual model: it goes back at 
least as far as O'SuUivan et al. (1986). Gu's (2004) gss package has an implementation of this, based on work 
by Kim and Gu (2004), and Wood (2006) attempted to extend an earlier performance oriented iteration method 
(Wood, 2004) in this direction. The difficulty with the direct approach is that its extra non-linearity makes efficient 
and stable methods difficult to develop: in consequence, the Gu (2004) and Wood (2006) methods are based on 
inefficient finite differencing based optimization, which is not always reliable. Figures [T] and |2] show data sets for 
which these three existing approaches fail. 

Figure[T]shows data on the presence or absence of Mackerel eggs from a sub region of a survey where absences 
are sufficiently common that it might be worthwhile to model presence/absence before attempting to model abun- 
dance, the real quantity of interest. Covariates longitude, latitude, distance from the continental shelf edge, sea bed 
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Figure 2: Simulated data with a serious concurvity problem. Response y depends on covariates z, x and d. 
Existing methods have difficulty fitting an appropriate GAM to these data. 

depth, surface temperature and temperature at 20m depth are available, and in the absence of more detailed prior 
knowledge a generaUzed additive model: 

\og\t{E{pi)} = /i(loni, lati) + /2(c.disti) + /3(b.depth.) + /4(t.surf i) + /5(t.20iiii) (1) 

might be appropriate, where pi is 1 for presence and for absence of eggs. A BemouilU distribution for pi is 
assumed. A tensor product of two rank 8 penalized regression splines should be more than sufficiently flexible for 
/i, while the remaining terms can be represented by rank 10 penalized regression splines. Performance oriented 
iteration diverges when attempting to fit this model. PQL based model fitting either fails to converge or 'converges' 
to a clearly sub-optimal 'completely smooth' model, depending on numerical details (if treated in the same way as 
a standard penalized likeUhood based model, the completely smooth model would have an AIC around 84 larger 
than the AIC optimal model). Direct optimization of the whole model (generaUzed) AIC also fails, using either a 
pure finite difference based optimization, or Wood's (2006) approach based on finite differencing only for second 
derivatives. In these cases careful examination of the optimization results does indicate the possibihty of problems, 
and estimated minimum AICs are approximately 20 above the actual optimum. All computations were performed 
using R 2.4.0. with GAM setup based on the mgcv package and mixed model estimation based on the nlme and 
MASS packages. Direct optimization was performed using the nlm routine (results from optim generally being 
substantially more problematic). 

Data sets that cause problems for existing GAM estimation and smoothness selection methods are not hard to 
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find, and the root cause is often some form of concurvity (i.e. the presence of covariates which are themselves well 
modelled as smooth functions of other covariates). Often such problems are difficult to isolate, but figure |2] shows 
simulated data for which the problem is clear. PQL and POI based methods both fail when used to fit the model: 

\ogit{E{yi)} = fi{x„ Zi) + f2{di), Bernoulli. (2) 

to these data, /i and /2 were represented using thin plate regression splines of basis dimension 30 and 10, with 
additional shrinkage so that a large enough smoothing parameter can shrink the function to zero (see Wood, 2006, 
4.1.6). PQL diverges until the linear mixed model estimator fails, while POI simply fails to converge. Again, 
direct smoothness selection using general purpose optimizers and finite difference derivatives fails (substantially) 
to locates the AIC optimal model. This latter failure occurs whether or not extra help in the form of exact first 
derivatives is supplied. 

It should be noted that the impact of concurvity in these examples is different in kind to the well publicized 
difficulties discussed by Ramsay, Burnett and Krewski (2003) in which concurvity can cause backfitting approaches 
to GAM estimation (as in Hastie and Tibshirani, 1990) to substantially underestimate estimator variances. For the 
direct GAM fitting approach, discussed here, the issue is reliably estimating the model in the presence of concurvity 
driven ill-conditioning. Once the model is estimated the corresponding variance estimates will automatically take 
into account the effect of the concurvity, so that variance correction approaches of the sort discussed, for example, 
in Figueiras, Roca-Pardinas and Cadarso-Suarez (2005) are not needed (and might actually be counterproductive). 
In other words, if the computational difficulties caused by concurvity can be solved for the direct fitting approach, 
then we avoid the major part of concurvity driven variance bias 'for free' . 

For applied use of GAMs these convergence failures are obviously problematic and the aim of this paper 
is to develop methods that eliminate them to the maximum extent possible. General fitting methods for GAMs 
should simply work (much as we expect of GLM fitting routines), without the need for tuning, otherwise the fitting 
methods get in the way of practical modelling. So the objective here is to produce the most reliable method possible 
for penaUzed likelihood based GAM estimation with AIC or GCV type smoothness selection. This suggests: 

1. The method must be 'direct' in the sense already defined, so that fixed optimum of the smoothness selection 
criteria exist. 

2. For maximal reliability, optimization of the criteria should be based on a full Newton method, utiUzing exact 
first and second derivatives, not approximations. 

3. The method must be able to deal with any linear degeneracy in the model, such as that caused by severe 
concurvity, or by the heavy smoothing that is appropriate in many practical modelling situations. 
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In addition the method must meet the practical consideration of being computationally efficient, and, given the 
non-linearities imposed by the direct approach, point 2 presents the main challenge in this regard. None of the 
forgoing aims are very difficult to achieve if the method is allowed an operation count that is cubic in the number 
of data, but such an expensive scheme would be of no practical interest. 

Faced with the task of producing such a method it might be tempting to abandon penalized likelihood in favour 
of a fully Bayesian MCMC approach (e.g. Fahrmeir and Lang, 2001; Fahrmeir, Kneib and Lang, 2004 or Brezger 
and Lang, 2006), but this is not always the answer, and can make problems harder to detect. Firstly, convergence 
problems can often become mixing problems. For example, using the data in figure [T] MCMC simulation with 
the mackerel model (HJ gives markedly reduced acceptance rates (minimum down to 10% from 75%), increased 
between chain variability and appears to require increased burn in, relative to similar models for data simulated 
without serious concurvity problems. Computations were performed using the state of the art BayesX package 
(Brezger, Kneib and Lang, 2007), so the model representation was slightly different to that used for the other 
computations in that the Bayesian P-splines of Lang and Brezger (2004) were used to represent the smooths com- 
ponents. The second issue with MCMC methods is that computational feasibility requires that the prior covariance 
matrix (smoothing penalty matrix) is sparse (this is the consideration that drives the choice of P-splines in BayesX). 
Many practically useful smoothers do not have this property (e.g. thin plate splines, as used in model (|2]l). In the 
absence of sparsity then the computational cost is of the order of the cube of the largest (non-sparse) smoothing 
basis dimension, multiplied by the chain length. For smooths with a simple prior covariance matrix structure, then 
in principle this cost can be reduced to the square of the largest basis dimension, multiplied by the number of 
steps of the chain actually used for posterior inference. Some form of Demmler-Reinsch orthogonalization (see 
e.g. Wood, 2006, 4.10.4) is what is needed to achieve this. However, such orthogonalization has yet to be tried 
in practice, may lead to more difficulties in setting the hyper-prior on smoothing parameters, and can not be done 
at all for the kind of penalty required in order to ensure scale invariance in smooth interaction terms (e.g. Wood, 
2006, 4.1.8 or Wahba 1990, Chapter 10). Situations in which quasi-likelihood is appropriate are also awkward to 
handle in an MCMC context. On the other hand the Bayesian MCMC approach improves what can be done with 
non smooth random effects, and is usually the best option when large numbers of such random effects are required. 

The remainder of this paper is structured as follows. Section 2 reviews essential background and discusses 
smoothness selection criteria. Section 3 proposes a method for efficient and stable optimization of such criteria, 
and hence for GAM fitting. Section 4 illustrates the comparative performance of the new method using simulated 
and real data, including a GAMM example. 
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2 GAM estimation and selection 



Generalized additive models (Hastie and Tibshirani, 1986) are generalized linear models (Nelder and Wedderburn, 
1972) in which the linear predictor is partly composed from a sum of smooth functions of some or all of those 
covariates. Hence the basic model structure is 

g{E{y,)} = X*e + J2fi{^j) (3) 

i 

where the yi are observations on independent random variables from some exponential family distribution, or 
faihng that, have a mean variance relationship appropriate for use of a quasi-likelihood approach to inference, g 
is a smooth monotonic 'link' function. X* is the i^^ row of the model matrix for any strictly parametric model 
components, and 9 is the corresponding parameter vector The fj are smooth functions of covariates Xj, which may 
be vector covariates. The fj are subject to identifiability constraints, typically that J^i fj i^ji ) = V j. Sometimes 
each smooth function may also be multiplied by some covariate, yielding a 'variable coefficient' model (Hastie 
and Tibshirani, 1993): the extension is trivial to handle in practice (as recognized by Eilers and Marx, 2002; it 
has also been available in R package mgcv since early 2002). The model can further be extended to include extra 
random effect terms to arrive at the generalized additive mixed model (GAMM, eg. Lin and Zhang, 1999). The 
Unk between penalized regression and mixed modelhng that lets GAMs be estimated as GLMMs also means that 
GAMMs can be estimated by the methods discussed here (see section 1431 1. 

The first step in GAM estimation is to represent the smooth terms in (O using bases with associated penalties 
(see, e.g., Marx and Eilers, 1998; Wood, 2006). Each smooth term is represented as 

k=l 

where the bjk {xj ) are known basis functions, chosen to have convenient properties, while the (3jk are unknown co- 
efficients, to be estimated. Associated with each smooth function is one or more measures of function 'wiggliness' 
0JSj(3j, where Sj is a matrix of known coefficients. Typically the wiggliness measure evaluates something like 
the univariate spline penalty J fj{x)^dx or its thin-plate spline generalization, but it may also be more complex, 
such as tensor product smooth penalty with multiple /3jSj/3j terms, (e.g. Wood, 2006, section 4.1.8). Intermedi- 
ate rank basis- penalty smoothers of this sort go back at least as far as Wahba (1980) and Parker and Rice (1985). 
Hastie and Tibshirani (1990, section 9.3.6) discussed using them for GAMs and O'Sullivan (1986) demonstrated 
their use in a wide variety of problems. 

Given bases for each smooth term, the GAM, (O, can be re-written as a GLM, g{E{yi)} = Xi/3, where X 
includes the columns of X* and columns representing the basis functions evaluated at the covariate values, while 
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j3 contains 6* and all the smooth coefficient vectors, j3j . The fit of this GLM is most conveniently measured using 
the deviance: 

D{(5) = 2{^^ax - m}4> 

where I is the log-likelihood, or log-quasi-likelihood of the model, and Zmax is the maximum possible value for 
I given the observed data, which is obtained by considering the MLE of a model with one parameter per datum 
(under which the model predicted E{yi) is simply yi). is a scale parameter, and the definition of D means that 
it can be calculated without knowledge of 4>. Maximizing the (quasi-) likelihood is equivalent to minimizing the 
deviance, and in several ways the deviance behaves rather Uke the residual sum of squares in Unear modeUng (see 
McCullagh and Nelder, 1989 for further details). 

If the bases used for the smooth functions, fj, are large enough to be reasonably sure of avoiding mis- 
specification, then the model will almost certainly overfit if it is estimated by minimizing the deviance. For this 
reason GAMs are estimated by minimizing 

3 

where the \j are smoothing parameters and the Sj are the Sj suitably padded with zeroes so that l3^Sj/3 = 
PjSj(3j. For later notational convenience, define S = J2j ^j^j- The Xj control the smoothness of the component 
smooth functions. Smoothness selection is about choosing values for the Xj. 

Given values for the Xj, the penalized deviance can be minimized by penalized iteratively re- weighted least 
squares (P-IRLS, see e.g. Wood, 2006, for one derivation, and Green and Silverman, 1994, for more information 
on penaUzed UkeUhood and GLMs). Let V{fi) be the function such that var(?/j) = V{fjbi)(j). V can be written down 
for aU exponential family distributions, and is always available if using quasi-UkeUhood. Let u)i denote any prior 
weights on particular data points (used to weight the component of deviance attributable to each datum). Then 
iterate the following steps to convergence. 

L Using the current /ij estimate, evaluate the weights, Wi = ujy^V{^i)~^/'^ /g'{^i), and the pseudodata, 
Zi = g'{lJ.i){yi - IJ.i)+riu where rn = g{tii) (the 'Unear predictor'). 

2. Let W be the diagonal matrix of Wi values. Minimize the penaUzed least squares objective 

||W(z-X^)f +^A,/3Ts,/3 (4) 



w.r.t. /3 to find the next estimate of /3, and hence of r; = X/3 and jii = g~^{rii). 
The iteration can be initialized by setting /tj = yi (with adjustment to avoid infinite %). Divergence is rare, but can 
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be dealt with by step halving (provided an MLE exists). At convergence the parameter estimates, (3, minimize the 
penalized deviance. 

Note that this direct fitting approach makes it straightforward to directly estimate coefficient variances (see 
e.g. Wood, 2006, section 4.8) thereby completely sidestepping the well publicized problem of concurvity driven 
variance underestimation, that can affect backfitting methods of GAM fitting (see Ramsay, Burnett and Krewski, 
2003, for example). 

2.1 Smoothness selection 

Performance oriented iteration (POI) uses GCV or Mallows' Cp (Mallows, 1973) applied to each fitting problem 
(|4|l in order to select smoothing parameters. This often converges to fixed /3, A, but less often it diverges, or cycles, 
with failure being particularly frequent for binary data (see section|4]or the Introduction). Mixed model alternatives 
such as PQL are no better. An alternative, which avoids this fundamental convergence issue, is to base smoothness 
selection on a criterion applied to the GAM itself and evaluated at convergence of the P-IRLS. 

If T denotes the effective degrees of freedom of the penalized fit, then one could seek to minimize the general- 
ized AIC: 

Va(A) = Z?(/3) + 27r 
in the case where (f) is known, or the generalized GCV score 

Vg{X) = nD{$)/{n~-fTf 

otherwise (see Hastie and Tibshirani, 1990, Section 6.9 or Wood, 2006, section 4.5). r = tr (A) where A = 
WX(X^W'^X + S)^^X^W is the 'influence matrix' of the fitted model (W is evaluated at convergence). 7 is 
an ad hoc tuning parameter, sometimes increased from its usual value of 1 in order to obtain smoother models than 
would otherwise be selected (7 can itself be chosen automatically by, e.g., 10-fold cross validation, but this will 
not be pursued further here). 

Another alternative, proposed by Xiang and Wahba (1996) and Gu and Xiang (2001) is Generalized Approx- 
imate Cross Validation, GACV (see also Gu, 2002, section 5.2.2, for a clear exposition). It was initially derived 
for the situation in which only the canonical link function is used, but the restriction can be relaxed (in the process 
providing one possible justification for Va and Vg). Some modification of Gu and Xiang's (2001) approach is the 
key, as outlined next. 

The basic idea is that the Kullback-Leibler distance depends on the model only through the 'predictive de- 
viance' of the model, which can be estimated by some version of leave-one-out cross validation. The resulting 
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leave-one-out cross validation criterion is then replaced with a generalized cross vahdation criterion. To this end, 
first write the model deviance as .0(17) = ^Di{fji), where Di is the contribution to the deviance associated with 
the i"^ datum. Now the mean predictive deviance of the model can be estimated by 

n 

where t)!"*! is the linear predictor obtained by estimating the model from all the data except the i'^ datum. Min- 
imization of Dcv is an attractive way of choosing smoothing parameters as it seeks to minimize the KL distance 
between the estimated model and the truth; however it is an impracticaUy expensive quantity to attempt to minimize 
directly. 

To progress, foUow Gu and Xiang (2001) and let Tyl"'! be the hnear predictor which results if Zi is omitted 
from the working linear fit at the final stage of the P-IRLS. This can be shown to imply that rji — fi\~^^ = {zi — 
r]i)Aii/{l - An). But zi-fii = g'{jli){yi - AO- so fn - ^7!"'' = g'iMiVi - A0^"/(1 - ^»0- Now take a first 
order Taylor expansion 



Noting that 



AWj-'i) - DM) + ^{f^ - Vi) = DM) - ||rr^^'(Ai)(2/^ - AO- 



= "^Wi— — — — ■, we have A Wi ) - DM) + 2- —u>i 



dfli 'V{fii)g'{fiiy ' 1-Au ' Vifii) • 

Using the same approximation employed in the derivation of GCV, the individual An terms are replaced by their 
average, tr (A) / n, to yield, 

DMi ) - DM) + 2 nT\^i~UT^^' 

n-tr(A) V{iJ,i) 

and averaging over the data gives a GACV score 

where P = J2i ^iiVi ~ /V{fii) is a 'Pearson statistic'. (The final term on the RHS might also be multiphed 
by 7 > 1 of course.) Although the basic motivation and approach comes directly from the cited references, the 
need to accommodate non-canonical links means that the final score differs a httle from Gu and Xiang (2001) in 
the terms in the final summation. 

Notice how V* is just a linear transformation of (generalized) AlC, with (j) = P{fi) /{n — tr (A)} in place of 
the MLE of 4>, and tr (A) as the model degrees of freedom: so if (p is known we might as well use Va- Viewed 
from this perspective there is also no reason not to use D{fj)/{n — tr (A)} for (j), in which case, for tr (A) <^ n, 
the resulting criterion would be approximately Vg. Of course these cormections are unsurprising: see Stone (1977). 

The next section discusses how best to optimize these criteria with respect to the smoothing parameters. 
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3 Stable fitting and V optimization 

Optimization of the V type criteria is basically hierarchical. The criteria are optimized with respect to the smooth- 
ing parameters, with any set of smoothing parameters implying a particular set of coefficient estimates 0, which 
are found by an 'inner' P-IRLS iteration. 

The dependence of Vg, V* and Va on the smoothing parameters is via D{0), r and possibly P{0), so that the 
key to successful V optimization is to obtain first and second derivatives of D{0), t and P(/3) with respect to the 
log smoothing parameters, pj ~ \og(Xj), in an efficient and stable way (logs are used to simplify optimization, 
since the Xj must be positive). Given these derivatives, the derivatives of the criteria themselves are easily obtained, 
and they can be minimized by modified Newton, or quasi-Newton methods to select smoothing parameters (e.g. 
Gill et al., 1981, Dennis and Schnabel, 1983). 

The required derivatives in turn depend on the derivatives of /3 with respect to p. Conceptually, these can be 
obtained by differentiating the P-IRLS scheme, and updating derivatives alongside the parameters as the P-IRLS 
progresses. However, while this is the way to obtain expressions for the derivatives, it is a poor way to arrange the 
computations (a prototype ^rs-f derivative based scheme of this sort is proposed in Wood, 2006, but is built on the 
method of Wood, 2004, making it both inefficient and difficult to extend to a second derivative scheme). Instead, 
for fixed p, 

1 . Iterate the P-IRLS scheme to convergence of the /3, ignoring derivatives and using the fastest available stable 
method for solving each P-IRLS problem. 

2. At convergence, with the weights, pseudodata and hence all matrix decompositions now fixed, iterate the 
expressions for the derivatives of the coefficient, /3, with respect to the log smoothing parameters p to 
convergence. 

3. Evaluate the derivatives of r = tr (A) with respect to p. 

Relative to accumulating derivatives alongside the P-IRLS, the method has a number of advantages. Firstly, the 
basic matrix decompositions and some other component matrix expressions stay fixed throughout the derivative 
iteration, reducing the cost of using the most stable decompositions for the derivative calculations, and avoiding re- 
calculation at each step of the P-IRLS. In addition, fewer steps are typically required for the derivative iteration than 
for the P-IRLS itself, thereby saving the cost of several derivative system updates, relative to parallel accumulation 
of derivatives and estimates. Purely practically, the derivative update becomes a separate 'add-on' to the P-IRLS 
iteration, which simpUfies implementation. The following subsections explain how the method works in more 
detail. 
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3.1 Iterating for derivatives of /3 

At any step of the P-IRLS, let B = (X^W^X + S)-ixTW and z' = Wz so that /3 = Bz'. By differentiating 
the P-IRLS presented in section|2l the following update algorithm results. 

Initialization: z is fixed at its converged value from the P-IRLS, and its derivatives w.rt. p are initially set to zero. 
The corresponding initial derivatives of (3 are then given by 

9/3 aB , , a^B , 

— -z and — 



dpk dpk dpkdpm dpkdpm 

where the derivatives of B are evaluated with all the T^, and Tkm terms defined in section [3721 set to zero. 
Iteration: The following steps are iterated to convergence (for all fc, m, such that k > m). 



1. Update 

dz' 



and 



dpk dpkdpm 
using the current derivatives of 0, as described in appendix A. 

2. Using the z' derivative from step 1, the derivatives of /3 are updated as follows, 

dB dB , ^dz' , 0^0 d^B , dB dz' dB dz' „ d^z' 
- -z +B- — and - — ^— = - — - — z' + - — — hB^ 



dpk dpk dpk dpkdpm dpkdpm dpk dp,n dp,n dpk dpkdpm 

Note that while B is fixed, its derivatives will change as the iteration progresses. 

Convergence of the iteration would usually be judged by examining convergence of the the first and second 
derivatives of the deviance with respect to p. Calculation of these is routine, given the derivatives of (3: the 
expressions are provided in Appendix B, and are also needed for obtaining the derivatives of the smoothness 
selection criteria themselves. 

3.2 Computing with the derivatives of B 

The /3 derivative update scheme involves derivatives of B, which need to be spelled out. Initially expressions for 
the derivatives will be obtained in terms of B, A = WXB, G = X^W^X + S and the diagonal matrices: 

m ,• / dwi 1 \ , ,. / d'^Wi 1 dwi dwi 1 

Tfe = diag and Tkm = diag 



dpkWiJ " °\dpkdpmWi dpkdpmwj^ 

(see appendix A for the derivatives of Wi). Noting that dG~^/dpk = -2BTfeB''" - e^'^G^^SfeG-^ the first 
derivative of B is 

dB 

— = -2BTfeA - eP^G-^SkB + BTk, 
dpk 
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while 



= ~2— — TkA — 2BTfeTOA — 2BTfe— h — — Tfe + BTfe^ 



OpkOPm Opm Opm Opn 



-S,B + G-^S,!?- ) - d'Le^^G-'S.B 



\ Opm Opn 

(where = 1 if m = A; and otherwise). Direct naive use of these expressions would be both computationally 
expensive (justforming A has O(n^g) cost) and potentially unstable, since it would not address the ill-conditioning 
problems that can be a feature of GAMs, especially when there is concurvity present. It is therefore necessary to 
develop a way of computing with these derivatives which is both maximally stable and keeps computational costs 
down to a small multiple of 0{nq^), the leading order cost of the P-IRLS. 

There are a number of more or less equivalent starting points for such a method, of which the following two 
are the most straightforward. 

1 . Find the Choleski factor, L, such that 

L^L = XW^X + S. 

This should actually be performed with pivoting (available in UNPACK, Dongarra et al. 1978), in case of 
co-linearity problems, and the pivoting appUed expUcitly to the colunms of X. The rank, r, of the pivoted 
Choleski factor L can then be estimated, by finding the largest upper left sub-matrix of L with acceptably 
low condition number (e.g. Golub and van Loan, 1996, section 5.5.7). L is upper triangular, and efficient 
and reliable estimation of the condition number of triangular matrices is fairly straightforward following 
Cline et al. (1979). If rank deficiency is detected then all but the r upper left rows and coluimis of L are 
dropped along with the corresponding colunms of X (for the current iteration only, of course). Now define 
q X r matrix P, the first r rows of which are given by with the remaining rows being zero. Also define 
n X r matrix K = WXL~^. 

2. A somewhat more stable approach first finds a square root E of S, so that E^E = S. A pivoted Choleski 
decomposition or (synnmetric) eigen-decomposition can be used to do this. Next form the QR decomposition 

/ 




where Q is a matrix with orthogonal (strictly orthonormal) columns and R is upper triangular. Again this 
should be performed with pivoting (Golub and van Loan, 1996), which must subsequently be applied to 
the colunms of X. An LAPACK routine was used for the decomposition (Anderson et al. 1999). Rank 
deficiency can be dealt with in exactiy the same way as it was for L. Again the redundant columns of Q and 
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rows and columns of R are dropped. Now let n x r matrix K be the first n rows of Q so that WX = KR, 
and define q x r matrix P as the matrix with first r rows given by R~^ and remainder packed with zeroes. 

K is formed explicitly, while P can either be formed expUcitly or computed with using its definition. K and 
P are all that are required from the decompositions for subsequent computation. Although the values taken by 
these two matrices depend on method of calculation, they are used in the same way irrespective of origin. Of the 
two calculation methods, the second, QR based, method is usually to be preferred over the first, since it avoids 
the exacerbation of any numerical ill-conditioning that accompanies explicit formation of XW^X, and instead 
is based on a stable orthogonal decomposition. The Choleski based method is faster, by about a factor of 2, but 
irreducible costs of the second derivative calculations, which are the same for both methods, substantially dilute 
this advantage in practice. A singular value decomposition (see Golub and van Loan, 1 996 or Watkins, 1991) based 
method is also possible, but is more costly and is not pursued here. 

Note that for the most part the pivoting used in either method does not effect the subsequent algorithm: it is 
simply that quantities such as the estimated coefficient vector and its covariance matrix must have the pivoting 
reversed at the end of the method. The only exception is the one already mentioned, that the pivoting will have to 
be applied to the columns of X before it can be used as part of the derivative updating iteration. 

It is now straightforward to show that = PP^ (strictly a sort of pseudo-inverse if the problem is rank 
deficient), B = PK^ and A = KK^, and some work establishes that 

— = -2PKTTfcKK^ - e'''=PP^SfePK^ + PK^T^ 

opk 

and 

= 4PK^T„KK^TfeKK^ + 2e''"'PP^S„PK^TfeKK^ - 4PKTT™TfcKK^ 



dpkdpn 

- 2PK^Tfc„KK^ + 4PK^TfcKK^T„KK^ + 2e'''"PK^TfcKP^S„PK^ 
- 2PKTTfcKK'rT„ - 2PKTT„KKTTfc - e^'-PP^S^PK^Tfe + PK^T^T^ 

+ PK"^Tfo„ + 2e'"=PK"^T„KP"^SfcPK"^ + e'"=e''"' PP"^S„PP"^SfcPK"^ 
+ 2e'"=PP"^SfcPK'rT„KK"r + e^''e^-PP"^SfcPP"^S„PK"r - e^'^PP'^SfePR'^T^ 

- 5^e'"=PP'^SfcPK"r. 

By inspection of the preceding two equations, it is clear that, given the one off 0{nq^) start up cost of forming 
K, the multiplication of a vector by either derivative of B is 0{nq). i.e. the leading order cost of computing the 
smoothing parameter derivatives of /3 and hence of the deviance or Pearson statistic has been kept at 0{nq^). 

13 



3.3 The derivatives of tr (A) 

Once the derivative iterations detailed in the previous sections are complete, it is necessary to obtain the derivatives 
of the effective degrees of freedom tr (A). These are 

dtr (A) 
dpk 

and 



= tr (TfcA - 2ATfeA - e'"'B'^SkB + ATk) 



= 2tr (Tfe^A + 2TfcT„A) - 4tr (TfcAT,„A + T^AT^A) 

OPkOPm 

- 2tr (2AT™TfeA + ATfc„.A) + 8tr (AT^AT^A) + 4tr (e^'" ATfeB^S^B + e^-" AT„B^SfcB) 

- tr {2eP"'TkB'^Srn'B + 2e'"=T„B'rSfcB + S'^eP'B'^SkB) + 2e'''"e'"=tr (B'^S^G-^SfeB) . (5) 

Obviously it would be hideously inefficient to evaluate the terms in Q by explicitly evaluating its various 
component matrices and then evaluating the traces. Rather, efficient calculation of the trace terms rests on: (i) 
the fact that evaluation of diag(CH), where C is ri x p, and H is p x n, takes np floating point operations, 
(ii) tr (CH) = tr (HC), (iii) careful choice of what to store and in what order to calculate it and (iv) the use 
of 'minimum column' square roots of the penalty matrices S,„. The actual evaluation uses the matrices K and 
P, defined in section |3^ and is detailed in appendix C. The leading order cost of the evaluation is 
operations, where M is the number of smoothing parameters. This is a considerable saving over finite differencing 
for second derivatives. 

Demonstrating that computing with K and P is as computationally efficient as is possible actually requires 
that the derivatives of B and tr (A) be written out in terms of the original matrix decomposition components, and 
the most efficient computation of each term then be considered. The minimum set of quantities required for the 
whole calculation is then assembled, at which point it becomes clear that maximum efficiency can be obtained by 
computing with K, P. This process is exceedingly tedious, and is omitted here. 

3.4 Optimizing AIC, GCV or GACV criteria 

Given the derivatives of r, D and P the derivatives of the Vg, V* or Va are easily obtained, and the criteria 
can be optimized by Newton type methods. Vg, V* and Va are indefinite over some parts of the smoothing 
parameter space, since they flatten out completely at very high or very low p^. values. In many modeling situations 
such regions are unavoidable, since the optimal pk for a term that is not needed in a model should tend to oo. 
When taking a full Newton approach such indefiniteness is readily identifiable and addressable using an eigen- 
decomposition, HAH^, of the Hessian matrix of V. Following Gill, Murray and Wright (1981) the Hessian is 
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replaced in Newton's method by HAH^ where Ajj = |Aii|. Since the replacement is positive definite, the resulting 
modified Newton direction is guaranteed to be a descent direction. Note that the eigen-decomposition is a trivial 
part of the total computational burden here. Another way of increasing convergence rates is to only optimize 
smoothing parameters for which the corresponding gradient of V is large enough to be treated as unconverged. 
When the converged parameters are optimized at 'working infinity', dropping them from optimization tends to 
improve the quadratic model underlying the Newton update of the remaining parameters. (Parameters re-enter the 
optimization if their corresponding gradient becomes large again.) 

Alternatively, one can work only with first derivatives, and use the quasi-Newton or Newton type algorithms 
built into R routines opt im and n Im, for example (see R Core Development Team, 2006 and Dermis and Schnabel, 
1983). The associated loss of computational speed is smaller than might be expected, as first derivatives are very 
cheap to obtain, and indefiniteness produces some degradation in the convergence rates of the 'pure' Newton 
method. However, as the initial example in the introduction emphasizes, a finite differencing based method will 
not always be rehable when faced with complex models and strong concurvity effects. 

4 Examples 

4.1 Performance in 'straightforward' situations 

A small simulation study was undertaken to illustrate the method's performance in non-problematic situations, 
which should not generate numerical problems and where the data do not display concurvity. The example is 
adapted from one presented in Wahba (1990, section 11.3). 

For each replicate, 400 values for each of 4 covariates, xi, . . . ,X4, were simulated independently from a 
uniform distribution on (0, 1). The covariates were used to produce a scaled Unear predictor of the form fji = 
hixu) + f2{x2i) + /3(X3^), whcrc, = 2 sin(7rx), /2(x) = exp(2a;) and ^(a:) = x^^{W{l - ,t)}V5 + 

10'*x^(l — Response data, y^, were then generated under one of 4 models, (i) Independent Bernoulli random 
deviates were generated, taking the value 1 with probability e''*/(l+e'''), where rji = — 5)/2.5; (ii) independent 
Poisson random deviates were generated with mean exp(77i), where r]i = Jji/T; (iii) independent gamma random 
deviates were generated with mean ex.p{rii), with jjj = fii/7 (and scale parameter 1); (iv) independent Gaussian 
random deviates were generated from N{r]i, Arji ) truncated (below) at zero, where rji = exp {fji/Q). 

To each repUcate a 4 term generalized additive model 

9{E:{yi)} = fi{xii) + h{xii) + h{x2,i) + h{xAi) 
was fitted. The Unk function, g, was the logit for the binary data, and log for the other cases. The /j were repre- 
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sented using rank 10 thin plate regression splines (Wood, 2003). The correct distribution was assumed for response 
models (i) to (iii), and for (iv) a quasi-likelihood approach was taken, with the variance assumed proportional to 
the mean. Each model was estimated by 5 alternative methods, (a) By the method presented in this paper using full 
first and second derivative information; (b) using the first derivative scheme presented here to optimize GCV/AIC 
scores using the 'nlm' routine from R (this seems to be the fastest and most reliabale R general purpose optimizer 
for this problem); (c) optimizing the GCV/AIC scores using finite difference based gradients with R optimizer 
'nlm'; (d) using Gu's (1992) performance oriented iteration as implemented in Wood (2004); (e) representing the 
model as a mixed model and estimating via Breslow and Clayton's (1993) PQL (using the 'nlme' library as the un- 
derlying mixed model fitter; Pinheiro and Bates, 2000). For methods a-d AlC was used for the binary and Poisson 
cases, and GCV for the other two. GACV was also tried but had marginally worse MSE/ predictive deviance than 
GCV, and is therefore not reported here. 

To measure model fit, 10000 new data were generated from the model concerned, and the fitted model was 
used to predict the (expected value of the) response variable. The prediction error was measured using the mean 
deviance of the prediction of the 10000 simulated response data minus the mean predictive deviance using the 
known truth. This predictive deviance loss is therefor zero if the model exactly reproduces the truth. In the quasi 
case the error model is incorrect, so the predictive deviance is not such a natural measure of performance and the 
mean square error in predicting the (covariate conditional) means over 10000 independent replicate data was used 
as the prediction error measure (although in fact the conclusions are no different if predictive deviance is used). 

The PQL iteration failed in 20, 8, 13 and 23 out of 200 replicates for the binary, Poisson, gamma and quasi 
models, respectively, while the POl failed to converge in 13, 5, 5 and 10 out of 200 replicates. The smaller failure 
rate for performance oriented iteration as opposed to PQL may reflect the fact that the penalized least squares 
estimator used for POI was specifically designed for use in this application (Wood, 2004). The new method 
converged successfully for all replicates. The nlm based methods produced warnings of potential convergence 
problems in about 2% of cases, but none of these in fact appear to be real failures: rather the warnings seem to be 
triggered by the indefinite nature of the smoothness objective. Failures are of course excluded from the reported 
predictive deviance (MSE) comparisons and timings. 

Time, in CPU seconds, to fit each replicate model was also recorded (on a Pentium M 2. 1 3Ghz processor with a 
Linux operating system). The results are summarized in figure[3] The new method clearly has lower computational 
cost than all the other methods apart from performance oriented iteration, although it turns out that, for each error 
model, the mean time required for the new method is actually less than that required by performance oriented 
iteration, as a result of the skew in the latter's timing distributions. 
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Figure 3: Boxplots of the distribution of the log]^Q(CPU seconds) used to fit the models to the simulated data in 
section \4~T\ 'new' is the new method; 'nlm' is the new method, but using only first derivatives; 'nim.fd' is the 
same optimization performed without derivative information; 'PI' is performance oriented iteration; 'PQL' is for 
the GAM estimated as a mixed model. The skew in the PI timing distributions means that the new method has the 
lowest mean time for all 4 models. The new method is also the most rehable — see text. 

Figure |4] summarizes the predictive performance of the various estimation methods for the 4 types of model. A 
'(-)' after the label for a method indicates that the method was significantly worse than the new method in a paired 
comparison using a Wilcoxon signed rank test (using the .05 level); a '(+)' indicates a significant improvement 
over the new method. The only operationally significant differences are the worse performance of performance 
oriented iteration in the gamma and quasi cases, the worse performance of PQL in the Poisson case, and the better 
performance of PQL in the quasi case, but even these differences are rather modest. Note that the PQL failures 
were mostly for replicates producing quite high MSB or PD loss results by the other methods (which needs to be 
born in mind when viewing figured) So the new method appears to greatly improve speed and reliability without 
sacrificing prediction performance. 



4.2 Generalized Additive Mixed Models 

The same argument that allows a GAM to be estimated as a Generalized Linear Mixed Model using PQL implies 
that many GLMMs can be estimated by the method developed in this paper. To illustrate this the simulations in the 
previous section were modified by splitting the data into 40 groups of size 10, and redefining the unsealed Unear 
predictor as fji = fiixu) + f2{x2i) + fsixsi + bj if observation i is from group j. The bj are i.i.d. N{0, 2^) 
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Figure 4: Boxplots of prediction loss measures for the 4 models used in the section 14.11 simulations, for each of 
5 fitting methods. Labels are as in figure [3] The improved speed and reliability is not at the price of prediction 
performance. 

random deviates. The models fitted to each replicate were modified to GAMMs with linear predictors, 

9{E{yi)} = fi{xii) + f2{x2i) + fsixai) + f4{x4i) + bj if i from group j. 

where the bj are assumed i.i.d. N{0, a^). The simulations were otherwise un-modified except that only the new 
method and PQL were compared. For the new method the random effects are treated just like a smooth with the 
identity matrix as the penalty coefficient matrix, and the associated smoothing parameter controlling a^. 

The mean square error in predicting rji with the bj set to zero, was used as the measure of model performance, 
and the number of CPU seconds needed for model estimation was recorded. Out of 200 replicates PQL failed in 
22, 12, 16 and 12 replicates for the binary, Poisson, gamma and quasi cases, respectively. The new method did 
not fail. Timings and MSE performances are shown in figure |5] Wilcoxon tests (paired) fail to detect significant 
differences between the MSE performance of the methods {p > .4), except in the quasi case (p < 10^^), where 
penalized quasi likelihood is significantly better than the new method, perhaps unsurprisingly. Operationally, the 
MSE differences seem to be small, while the improvements of the new method in terms of speed and reliability are 
substantial. 



4.3 Severe concurvity 

Using R 2.4.0, data were simulated with severe concurvity problems. 
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Figure 5: Upper: Boxplots of the distribution of the log]^Q(CPU seconds) used to fit the models to the simulated 
data in section 14.21 In the labels p, b, q and g refer to the Poisson, binomial, quasi and gamma distributional 
assumptions. Lower: Boxplots of MSE^^ for the various GAMM fits to simulated data discussed in section |4!2] 
The new method is faster and more reliable at little or no performance cost. 

set . seed (23 ) ; n <- 400;x <- runif(n);z <- runif (n) 

d <- x"3 + rnorin(n) *0.01;f <- (d-.5 + 10* (d- . 5) "3) *10 

g<-binomial ( ) $linkinv (f ) ; y <- rbinom (g, 1 , g) 

See figure |2] These data are rather extreme, but concurvity problems of this sort are not uncommon in real data 
examples with many predictors, although they are usually less obvious. The advantage of a simple simulated 
example is that the root cause of the associated fitting problems is clear, while the 'right answer' is known. 

The data were modeled using as described in the introduction, and existing methods fail to provide sat- 
isfactory fits, if they produce a fit at all. The new method converged to an estimated model which substantially 
suppressed /i (its effective degrees of freedom were reduced to .8) while doing a reasonable job at estimating 
/2. The estimated model components are shown in figure |6] Following Wood (2004) the performance oriented 
iteration can be made convergent by regularizing the working penalized linear models at each iterate. However 
the results are very sensitive to the exact degree of regularization performed, with only a narrow window between 
convergence failure and gross oversmoothing. This is clearly unsatisfactory. 

In replicate simulations of this sort the new method is persistently more reliable than the alternatives, although 
there are of course a substantial number of replicates where all methods perform reasonably (the given replicate is 
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Figure 6: The estimates of /i (left) and /2 (right, estimates and 95% confidence limits) from (|2]i, obtained using 
the new method, /i has been almost shrunk to zero. The right hand figure also shows the true /2 as a thick black 
curve (centered in the same way as the smooth). Previous methods fail for this example. 

unusual in that all the alternatives perform badly). No replicates where found where the new method performed 
less well than the alternatives. 

4.4 A fisheries survey 

This final example concerns modelling of fish egg data from a survey of mackerel eggs conducted in 1992 off the 
west coast of the British Isles and France. The purpose of the survey is to assess the abundance of fish eggs in order 
to infer the total mass of spawning fish producing them. The data consist of egg counts from samples collected 
by hauling a sampling net through the water column from below the maximum depth at which eggs are found, 
to the surface, and are shown in figure|7] Along with egg densities, egg, the covariates long, lat, b . depth, 
c . dist, temp . surf and temp . 2 0m were recorded, these being longitude, latitude, depth of sea bed below 
surface, distance from the 200m sea bed depth contour (a proxy for distance from the continental shelf edge), 
surface water temperature and water temperature at 20 m depth, respectively. In addition the area of the sampling 
net was recorded. There are 634 egg counts spread over the survey area. See Borchers et al. (1997) or Bowman 
and Azzalini (1997) for further details. 

This survey also formed the basis for the presence absence data in figure [T] used to illustrate convergence 
failure in the introduction. Unlike previous methods, the new method successfully fits ([T]), identifying a genuine 
'minimum AIC model (i.e. the AIC has zero gradient and positive definite Hessian at the optimum). One can 
make the same point by modelling presence absence over the whole survey area, but given the spatial distribution 
of presences such a model is not practically defensible. 

Turning to the modelling of egg densities (and neglecting any zero inflation problems), a reasonable initial 
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Figure 7: The raw mackerel data (left, symbol area proportional to egg density) and the non-zero estimated terms 
from the mackerel egg model of section l4!4l The central figure shows the spatial smooth over the survey region. 
The right hand figures show the estimated smooths of square root of sea bed depth and water temperature at 20 
metres depth. PQL and performance oriented iteration fail for this example. 

model for the data is 

log{i;(egg^)} = /i(long., lat,) + /2(^b.depthJ + /3(c.distj) + /4(temp.surf J 

+ /5 (temp. 20m-) + log(net.areai) 

along with the 'quasi-Poisson' assumption var(eggj) cx E{egg^) and the assumption that the response variable 
is independent (conditional on the covariates). /i, . . . , /s can be represented using penalized thin plate regres- 
sion splines with shrinkage (see Wood, 2006), employing basis dimensions of 100 for /i and 10 for each of the 
remaining terms. 

Attempts to fit this model using performance oriented iteration fail, without extra regularization: the iteration 
cycles without ever converging. PQL is no more successful: it diverges until the routine for estimating the working 
linear mixed model fails. In contrast the new method fits the model without difficulty. The raw fit shows signs of 
overfitting (informal significance measures for several terms indicate that they have no real effect on the response, 
despite having fairly high estimated effective degrees of freedom). For this reason the model was re-fitted with 
7 — 1.4 in the GCV score (see Kim and Gu, 2004). Two model terms were then estimated to have zero effective 
degrees of freedom (i.e. were penalized out of the model). The remaining terms are shown in figure|2] 

The difficulties in estimating the model by performance oriented iteration or PQL are again likely to relate to 
concurvity issues: all the covariates are functions of spatial location, some of them quite smooth functions. In 
addition the data contain 265 zeroes, and over half the counts are or 1 . At these very low counts the assump- 
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tions underlying PQL are likely to be somewhat poor, while the linearized problem used in performance oriented 
iteration is unhkely to capture the full model's dependency on smoothing parameters very precisely. 

5 Conclusions 

Relative to PQL or performance oriented iteration the new method offers two substantial advantages for GAM (or 
GAMM) estimation and smoothness selection. 

1. It is more computationally reliable. Since smoothing parameter is based on optimizing a properly defined 
function, fitting does not suffer from the convergence problems suffered by PQL or performance oriented 
iteration. 

2. The value of the optimized smoothness selection criteria (GCV/AIC) is useful for model comparisons, since 
it relates to the model being fitted, rather than to some working approximation as is the case for PQL or POI. 

In addition the new method is much quicker than PQL, and competitive with performance oriented iteration (in 
simulations the median cost of the new method is higher while the mean cost is lower). Another less obvious 
benefit of the new approach is that it integrates easily with step reduction procedures for stabilizing the P-IRLS 
algorithm if it diverges, as it occasionally does, particularly in the early steps of fitting binary data. Since the 
P-IRLS is run to convergence with fixed smoothing parameters, it is easy to detect divergence — this is not the 
case with performance oriented iteration or PQL, where the smoothing parameters change alongside the parameter 
estimates at each step of the iteration, so that any possible measure of fit may legitimately increase or decrease 
from one iteration step to the next. The disadvantage of the new method is the complexity of sections 13.21 13.31 
and associated appendices, with little carrying over from the linear problem. However, this disadvantage is a one 
off. Once the method has been implemented, it is hard to imagine circumstances in which performance oriented 
iteration or a finite differencing based method would be preferable. 

Relative to finite difference based optimization of GCV/AIC scores, the new method offers much improved 
computational speed. In difficult modelling situations it also offers enhanced reliability, by elimination of the finite 
difference approximation error which can lead to false convergence. It is not hard to see why problems might arise 
in finite differencing. The quantities being differentiated are the converged state of an iterative algorithm, which 
has to adaptively cope with ill-conditioning problems. Unless very elaborate finite difference schemes are applied 
there is always a danger that the values that get differenced result from different numbers of steps of the P-IRLS, or 
have had different levels of truncation applied to cope with ill-conditioning: either case can easily cause the finite 
difference approximation to fail even to get the sign of the derivative right. The new method eliminates this issue. 
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An obvious alternative to section[3]would be to use auto-differentiation to automatically accumulate derivatives 
of the smoothness criteria directly from the computer code evaluating the criteria (see Skaug and Fournier, 2006, 
for a good statistically based introduction). However, 'forward mode' auto-differentiation has an operations count 
of the same order as finite differencing making it uncompetitive here, while the alternative 'reverse mode' requires 
storage of every intermediate result in the algorithm being differentiated, which is impractical in the current context. 

How far does the proposed method go towards the aim, stated in the introduction, of making GAM fitting 
with smoothness selection as routine as GLM fitting? The aim is the same as that given in Wood (2004), but 
that paper was restricted to performance oriented iteration, a method for which convergence to any sort of fixed 
point can not be guaranteed (and may have to be forced by ad hoc regularization). By taking the direct approach 
the new method is based on optimizing criteria which have well defined optima for any model. This avoids the 
convergence issue, but replaces it with the problem of how to find the optimum in as efficient and stable a manner as 
possible, something that is made difficult by the additional non-linearities introduced by the direct approach. The 
new method succeeds in providing very efficient direct calculation of the derivatives of the smoothness selection 
criteria, as is evident in the surprising timing results, relative to performance oriented iteration, given in section|4T| 
It is unlikely that further substantial improvements are possible in this regard. As highlighted in the introduction, 
numerical stability is an important and unavoidable issue when working with models as flexible as GAMs, and 
the methods proposed here directly address the rank deficiency that may cause this. The QR approach to the basic 
fitting problem is the most stable method known, while the approach taken to rank determination has performance 
close to the 'gold standard' of SVD (see Golub and van Loan, 1996). Again then, there is no obvious alternative 
that might result in a more stable method. In short, the proposed method achieves the stated aim as closely as is 
likely to be achievable (which seems to be quite close). 

The method described here is implemented in R package mgcv (cran . r-pro ject . org). 
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Appendix A: The derivatives of z 



The z derivative update referred to in section [STI is given here. Note that /i^, rji, zi and Wi are always taken as 
being evaluated at the converged /3. 

Initialization: z, w, and r\ are fixed at their converged values from the P-IRLS, but all their derivatives w.r.t. p 
are initially set to zero. The initial derivatives of /3 w.rt. p are as in section [TTl At the converged estimate of yn, 
evaluate the constants: cu = {Vi - fii)g"{ni)/g'{ni), C2i = [{Vi - f^i){9"'{f^i)/9'{tJ'i)'^ - g" ifJ-i)'^ / g' if^i)^} - 
g''{^l,)/g'{^i,nc3^^wUV'{^l^)g'{^l^) + 2Vi^i,)g''i^i,)}/{2^^^^ 
3g"{fi,)V'{fi,)}/{2u;,g'{fi,)}. 

Update: The following steps update the z derivatives given the (3 derivatives (for all k, m, such that k > m). 



1. Evaluate 



dpk dpk dpkdpni dpkdpra' 

2. Update the derivatives of z: 

— Cu- — and - — - — = c^- — y C2i 



OPk Opk OpkOprn OPkOPra Opm OPk 

3. Update the derivatives of = uj^^'^V{p,i)~^^'^ /g'{p,i): 

dwi drji d'^Wi 3 dwi dwi d'^r]i dr]i dru 

-C3iT: — and - — - — = — - — C3i- — cu 



dpk dpk dpkdpm w, dpk dpm dpkdpm dpm dpk ' 

4. The derivatives of z' are evaluated: 

dz'i dwi dzi 9^z' d'^Wi dwi dzi dwi dzi d'^Zi 

-Zi + Wi— — and — — - — — — — — Zi + — — — h — 7; h Wi 



dpk dpk dpk dpkdpra dpkdpm dpk dpm dpra dpk dpkdpr, 

Appendix B: Deviance and Pearson statistic derivatives 

The derivatives of the deviance can be obtained as follows. 

dD _^dD dl3j d^D d^D d(3i dfij \ dD d^fij 



ril - V — u _ V V 



dpk ^ dPj dpk dpkdpm ^ y-^ dfijdlSi dp„i dpk J 8(3^ dpkdp„ 
The required derivatives of the deviance w.rt. (3 are 



— ^ = 2 V( 



— — ~ —2 y ijji——, — 7 — ; — T^ij and 



Xij. 
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So, defining c as the vector with elements Ci — —2uji{yi — IJ'i)/{V{iii)g'{fii)}, the vector of first derivatives 
of D w.r.t. the (3j is X^c. Now noting that d^i/d(3i = Xii/g'{^i) and defining 

1 



e,; = 2w,; 



Vi - fJ-i 



{v'{^l,)g'{^l,) + v{^l.)g"{^,.)} 



the second derivative matrix (Hessian) of D is X^diag(ei)X. 

The derivatives of the Pearson statistic, P, are easily obtained by noting that 

iVt - fMf 



i=l ' i=l 

The expression in terms of the iterative weights, pseudodata and Unear predictor makes evaluation of the derivatives 
of P particularly straightforward, since the derivatives of all Wi, Zi and T]i are available directly from the derivative 
iteration. 



Appendix C: Efficient evaluation of the derivatives of tr (A) 

In the following, wherever \/S„i is written it denotes the q x rank(S,„) matrix such that \/^m\/'^m = 
(pivoted Choleski decomposition can be used to find these, see Golub and van Loan, 1996 and Dongarra et al. 
1978). The following list gives the key steps for evaluating each of the different types of term making up the 
second derivatives of tr (A) as given on the RHS of equation ((5). 

1. For tr (TfemA) etc. first form and store diag (A) = diag (KK^) and the term follows. 

2. tr(TfeAT„A) = tr ([K^TfcK] [K^T^K]) = tr (T,„ATfcA) (the second equality follows from trans- 
posing the matrix expression in the middle trace). This requires storage of K^T^K, in advance. 

3. Terms like tr(ATfe„A) follow from tr(ATfe™ A) =tr(Tfo„AA). So, diag (AA) = diag ([KK^K] [K^]) 
is evaluated once up front (having first formed K^K and then KK^K) and the result is then readily com- 
puted. 

4. K^TfeKK^K is stored up front so that use can be made of 
tr(ATfeAT„A) = tr ([KTTfeK][KTT„KKTK]). 

5. tr (ATfcBTS„B) = tr (^Tfe[KpTVs;;;][Vs;;;"^PKTKKT]^, so evaluate 

diag ^[KP^\/S^ [v^S^^PK^KK^^ and the result is easily obtained. This required up front storage of 

KK^KpT^s;;; and KP^^s;;:. 
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6. Evaluate diag (B'^S^B) = diag (^[KpTv^s;;^[v^^PKT]j and terms like 
tr (TfeB^S^B) follow easily. 

7. Finally, if P'^S^P and P'^S^PR'^K are stored up front, then 

tr (BTS^G-^SfcB) = tr ([pTS^PlpTSfePK^K]) is easily obtained. 

Notice that, if M is the number of smoothing parameters, then by far the most expensive calculation here is the 
evaluation of the M terms K^T^K in step 2. This has a total cost of nq^M/2 floating point operations, which is 
still a considerable saving over finite differencing to get second derivatives. Note also that aU terms in tr (A) and 
its first derivatives are covered in the above list, and have a total leading order computational cost of 0{nq^), the 
same as model estimation: this is an M + 1 fold saving over finite differencing. 
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